Computational methods and systems for multidimensional analysis

ABSTRACT

A method for analyzing data obtained from at least one sample in a separation system ( 10, 50, 60 ) that has a capability for separating components of a sample containing more than one component as a function of at least two different variables comprising obtaining data representative of the at least one sample from the system, the data being expressed as a function of the two variables; forming a data stack ( 70, 74, 78, 82, 84 ) having successive levels, each level containing successive data representative of the at least one sample; forming a data array (R) representative of a compilation of all of the data in the data stack; and separating the data array into a series of matrixes. A chemical analysis system that operates in accordance with the method, and a medium having computer readable program code for causing the system to perform the method.

This application is a divisional of U.S. application Ser. No. 10/554,863filed on Oct. 28, 2005, which is a United States national stageapplication under 35 U.S.C. 371 of PCT/US2004/013097, filed on Apr. 28,2004, which in turn claims priority from provisional application Ser.Nos. 60/466,010, 60/466,011 and 60/466,012, all filed on Apr. 28, 2003.This application also claims priority from U.S. application Ser. No.10/689,313 filed on Oct. 20, 2003. The entire contents of all of theseapplications are incorporated by reference herein, for all purposes.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to chemical analysis systems. Moreparticularly, it relates to systems that are useful for the analysis ofcomplex mixtures of molecules, including large organic molecules such asproteins, environmental pollutants, and petrochemical compounds, tomethods of analysis used therein, and to a computer program producthaving computer code embodied therein for causing a computer, or acomputer and a mass spectrometer in combination, to affect suchanalysis. Still more particularly, it relates to such systems that havemass spectrometer portions.

2. Prior Art

The race to map the human genome in the past several years has created anew scientific field and industry named genomics, which studies DNAsequences to search for genes and gene mutations that are responsiblefor genetic diseases through their expressions in messenger RNAs (mRNA)and the subsequent coding of peptides which give rise to proteins. Ithas been well established in the field that, while the genes are at theroot of many diseases including many forms of cancers, the proteins towhich these genes translate are the ones that carry out the realbiological functions. The identification and quantification of theseproteins and their interactions thus serve as the key to theunderstanding of disease states and the development of new therapeutics.It is therefore not surprising to see the rapid shift in both thecommercial investment and academic research from genes (genomics) toproteins (proteomics), after the successful completion of the humangenome project and the identification of some 35,000 human genes in thesummer of 2000. Different from genomics, which has a more definable endfor each species, proteomics is much more open-ended as any change ingene expression level, environmental factors, and protein-proteininteractions can contribute to protein variations. In addition, thegenetic makeup of an individual is relatively stable whereas the proteinexpressions can be much more dynamic depending on various disease statesand many other factors. In this “post genomics era,” the challenges areto analyze the complex proteins (i.e., the proteome) expressed by anorganism in tissues, cells, or other biological samples to aid in theunderstanding of the complex cellular pathways, networks, and “modules”under various physiological conditions. The identification andquantitation of the proteins expressed in both normal and diseasedstates plays a critical role in the discovery of biomarkers or targetproteins.

The challenges presented by the fast-developing field of proteomics havebrought an impressive array of highly sophisticated scientificinstrumentation to bear, from sample preparation, sample separation,imaging, isotope labeling, to mass spectral detection. Large data arraysof higher and higher dimensions are being routinely generated in bothindustry and academia around the world in the race to reap the fruits ofgenomics and proteomics. Due to the complexities and the sheer number ofproteins (easily reaching into thousands) typically involved inproteomics studies, complicated, lengthy, and painstaking physicalseparations are performed in order to identify and sometime quantifyindividual proteins in a complex sample. These physical separationscreate tremendous challenges for sample handling and informationtracking, not to mention the days, weeks, and even months it typicallytakes to fully elucidate the content of a single sample.

While there are only about 35,000 genes in the human genome, there arean estimated 500,000 to 2,000,000 proteins in human proteome that couldbe studied both for general population and for individuals undertreatment or other clinical conditions. A typical sample taken fromcells, blood, or urine, for example, usually contains up to severalthousand different proteins in vastly different abundances. Over thepast decade, the industry has popularized a process that includesmultiple stages in order to analyze the many proteins existing in asample. This process is summarized in Table 1 with the following notablefeatures:

TABLE 1 A Typical Proteomics Process: Time, Cost, and Informatics NeedsSteps Proteomics Process Sample Isolate proteins from biological samplessuch as blood, collection tissue, urine, etc. Instrument cost: minimal;Time: 1-3 hours Mostly liquid phase sample Need to track samplesource/preparation conditions Gel separation Separate proteins spatiallythrough gel electrophoresis to generate up to several thousand proteinspots Instrument cost: $150 K; Time: 24 hours Liquid into solid phaseNeed to track protein separation conditions and gel calibrationinformation Imaging Image, analyze, identify protein spots on the gelwith and MW/pI calibration, and spot cutting. spot cutting Instrumentcost: $150 K; Time: 30 sec/spot Solid phase Track protein spot images,image processing para- meters, gel calibration parameters, molecularweights (MW) and pI's, and cutting records Protein Chemically break downproteins into peptides digestion Instrument cost: $50 K; Time: 3 hoursSolid to liquid phase Track digestion chemistry & reaction conditionsProtein Spotting Mix each digested sample with mass spectral matrix, orspot on sample targets, and dry (MALDI) or sample Sample preparation forLC/MS(/MS) preparation Instrument cost: $50 K; Time: 30 sec/spot Liquidto solid phase Track volumes & concentrations for samples/reagents Massspectral Measure prptide(s) in each gel spot directly (MALDI) analysisor via LC/MS(/MS) Instrument: $200 K-650 K; Time: 1-10 sec/spot on MALDIor 30 min/spot on LC/MS(/MS) Solid phase on MALDI or liquid phase onLC/MS(/MS) Track mass spectrometer operation, analysis, and peakprocessing parameters Protein Search private/public protein data basesto identify database search proteins based on unique peptides Instrumentcost: minimal; Time: 1-60 sec/spot Summary Instrument cost: $600 K-$1 MTime/sample: several days minimala. It could take up to several days or weeks or even months to completethe analysis of a single sample.b. The bulky hardware system costs $600,000 to $1M with significantoperating (labor and consumables), maintenance, and lab space costassociated with it.c. This is an extremely tedious and complex process that includesseveral different robots and a few different types of instruments toessentially separate one liquid sample into hundreds to thousands ofindividual solid spots, each of which needs to be analyzed one-at-a-timethrough another cycle of solid-liquid-solid chemical processing.d. It is not a small challenge to integrate these pieces/steps togetherfor a rapidly changing industry, and as a result, there is not yet acommercial system that fully integrates and automates all these steps.Consequently, this process is fraught with human as well as machineerrors.e. This process also calls for sample and data tracking from all thesteps along the way—not a small challenge even for today's informatics.f. Even for a fully automated process with a complete sample and datatracking informatics system, it is not clear how these data ought to bemanaged, navigated, and most importantly, analyzed.g. At this early stage of proteomics, many researchers are content withqualitative identification of proteins. The holy grail of proteomics is,however, both identification and quantification, which would open doorsto exciting applications not only in the area of biomarkeridentification for the purpose of drug discovery but also for clinicaldiagnostics, as evidenced by the intense interest generated from arecent publication (Pertricoin, E. F. III et al., Lancet, Vol. 359, pp.573-77, (2002)) on using protein profiles from blood samples for ovariancancer diagnostics. The current process cannot be easily adapted forquantitative analysis due to the protein loss, sample contamination, orlack of gel solubility, although attempts have been made forquantitative proteomics with the use of complex chemical processes suchas ICAT (isotope-coded affinity tags); a general approach toquantitation wherein proteins or protein digests from two differentsample sources are labeled by a pair of isotope atoms, and subsequentlymixed in one mass spectrometry analysis (Gygi, S. P. et al. Nat.Biotechnol. 17, 994-999 (1999)).

Isotope-coded affinity tags (ICAT) is a commercialized version of theapproach introduced recently by the Applied Biosystems of Foster City,Calif. In this technique, proteins from two different cell pools arelabeled with regular reagent (light) and deuterium substituted reagent(heavy), and combined into one mixture. After trypsin digestion, thecombined digest mixtures are subjected to the separation bybiotin-affinity chromatography to result in a cysteine-containingpeptide mixture. This mixture is further separated by reverse phase HPLCand analyzed by data dependent mass spectrometry followed by databasesearch.

This method significantly simplifies a complex peptide mixture into acysteine-containing peptide mixture and allows simultaneous proteinidentification by SEQUEST database search and quantitation by the ratioof light peptides to heavy peptides. Similar to LC/LC/MS/MS, ICAT alsocircumvents insolubility problem, since both techniques digest wholeprotein mixture into peptide fragments before separation and analysis.

While very powerful, ICAT technique requires a multi-step process forlabeling and pre-separation process, resulting in the loss of lowabundant proteins with added reagent cost and further reducing thethroughput for the already slow proteomic analysis. Since onlycysteine-containing peptides are analyzed, the sequence coverage istypically quite low with ICAT. As is the case in typical LC/MS/MSexperiment, the protein identification is achieved through the limitednumber of MS/MS analysis on hopefully signature peptides, resulting inonly one and at most a few labeled peptides for ratio quantitation.

Liquid chromatography interfaced with tandem mass spectrometry(LC/MS/MS) has become a method of choice for protein sequencing (YatesJr. et al., Anal. Chem. 67, 1426-1436 (1995)). This method involves afew processes including digestion of proteins, LC separation of peptidemixtures generated from the protein digests, MS/MS analysis of resultedpeptides, and database search for protein identification. The key toeffectively identify proteins with LC/MS/MS is to produce as many highquality MS/MS spectra as possible to allow for reliable matching duringdatabase search. This is achieved by a data-dependent scanning techniquein a quadrupole or an ion trap instrument. With this technique, the massspectrometer checks the intensities and signal to noise ratios of themost abundant ion(s) in a full scan MS spectrum and perform MS/MSexperiments when the intensities and signal to noise ratios of the mostabundant ions exceed a preset threshold. Usually the three most abundantions are selected for the product ion scans to maximize the sequenceinformation and minimize the time required, as the selection of morethan three ions for MS/MS experiments would possibly result in missingother qualified peptides currently eluting from the LC to the massspectrometer.

The success of LC/MS/MS for identification of proteins is largely due toits many outstanding analytical characteristics. Firstly, it is a quiterobust technique with excellent reproducibility. It has beendemonstrated that it is reliable for high throughput LC/MS/MS analysisfor protein identification. Secondly, when using nanospray ionization,the technique delivers quality MS/MS spectra of peptides atsub-fentamole levels. Thirdly, the MS/MS spectra carry sequenceinformation of both C-terminal and N-terminal ions. This valuableinformation can be used not only for identification of proteins, butalso for pinpointing what post translational modifications (PTM) haveoccurred to the protein and at which amino acid reside the PTM takeplace.

For the total protein digest from an organism, a cell line, or a tissuetype, LC/MS/MS alone is not sufficient to produce enough number of goodquality MS/MS spectra for the identification of the proteins. Therefore,LC/MS/MS is usually employed to analyze digests of a single protein or asimple mixture of proteins, such as the proteins separated by twodimensional electrophoresis (2DE), adding a minimum of a few days to thetotal analysis time, to the instrument and equipment cost, and to thecomplexity of sample handling and the informatics need for sampletracking. While a full MS scan can and typically do contain richinformation about the sample, the current LC/MS/MS methodology relies onthe MS/MS analysis that can be afforded for only a few ions in the fullMS scan. Moreover, electrospray ionization (ESI) used in LC/MS/MS hasless tolerance towards salt concentrations from the sample, requiringrigorous sample clean up steps.

Identification of the proteins in an organism, a cell line, and a tissuetype is an extremely challenging task, due to the sheer number ofproteins in these systems (estimated at thousands or tens of thousands).The development of LC/LC/MS/MS technology (Link, A. J. et al. Nat.Biotechnol. 17, 676-682 (1999); Washburn, M. P. et al, Nat. Biotechnol.19, 242-247 (2001)) is one attempt to meet this challenge by going afterone extra dimension of LC separation. This approach begins with thedigestion of the whole protein mixture and employs a strong cationexchange (SCX) LC to separate protein digests by a stepped gradient ofsalt concentrations. This separation usually takes 10-20 steps to turnan extremely complex protein mixture into a relatively simplifiedmixture. The mixtures eluted from the SCX column are further introducedinto a reverse phase LC and subsequently analyzed by mass spectrometry.This method has been demonstrated to identify a large number of proteinsfrom yeast and the microsome of human myeloid leukemia cells.

One of the obvious advantages of this technique is that it avoidsinsolubility problems in 2DE, as all the proteins are digested intopeptide fragments which are usually much more soluble than proteins. Asa result, more proteins can be detected and wider dynamic range achievedwith LC/LC/MS/MS. Another advantage is that chromatographic resolutionincreases tremendously through the extensive 2D LC separation so thatmore high quality MS/MS spectra of peptides can be generated for morecomplete and reliable protein identification. The third advantage isthat this approach is readily automated within the framework of currentLC/MS system for potentially high throughput proteomic analysis.

The extensive 2D LC separation in LC/LC/MS/MS, however, could take 1-2days to complete. In addition, this technique alone is not able toprovide quantitative information of the proteins identified and aquantitative scheme such as ICAT would require extra time and effortwith sample loss and extra complications. In spite of the extensive 2DLC separation, there are still a significant number of peptide ions notselected for MS/MS experiments due to the time constraint between theMS/MS data acquisition and the continuous LC elution, resulting in lowsequence coverage (25% coverage is considered as very good already).While recent development in depositing LC traces onto a solid supportfor later MS/MS analysis can potentially address the limited MS/MScoverage issue, it would introduce significantly more sample handlingand protein loss and further complicate the sample tracking andinformation management tasks.

Matrix-Assisted Laser Desorption Ionization (MALDI) utilizes a focusedlaser beam to irradiate the target sample that is co-crystallized with amatrix compound on a conductive sample plate. The ionized molecules areusually detected by a time of flight (TOF) mass spectrometer, due totheir shared characteristics as pulsed techniques.

MALDI/TOF is commonly used to detect 2DE separated intact proteinsbecause of its excellent speed, high sensitivity, wide mass range, highresolution, and contaminant-forgivingness. MALDI/TOF with capabilitiesof delay extraction and reflecting ion optics can achieve impressivemass accuracy at 1-10 ppm and mass resolution with m/Δm at 10000-15000for the accurate analysis of peptides. However, the lack of MS/MScapability in MALDI/TOF is one of the major limitations for its use inproteomics applications. Post Source Decay (PSD) in MALDI/TOF doesgenerate sequence-like MS/MS information for peptides, but the operationof PSD often is not as robust as that of a triple quadrupole or an iontrap mass spectrometer. Furthermore, PSD data acquisition is difficultto automate as it can be peptide-dependent.

The newly developed MALDI TOF/TOF system (Rejtar, T. et al., J.Proteomr. Res. 1(2) 171-179 (2002)) delivers many attractive features.The system consists of two TOFs and a collision cell, which is similarto the configuration of a tandem quadrupole system. The first TOF isused to select precursor ions that undergo collisional induceddissociation (CID) in the cell to generate fragment ions.

Subsequently, the fragment ions are detected by the second TOF. One ofthe attractive features is that TOF/TOF is able to perform as many datadependent MS/MS experiments as necessary, while a typical LC/MS/MSsystem selects only a few abundant ions for the experiments. This uniquedevelopment makes it possible for TOF/TOF to perform industry scaleproteomic analysis. The proposed solution is to collect fractions from2D LC experiments and spot the fractions onto an MALDI plate for MS/MS.As a result, more MS/MS spectra can be acquired for more reliableprotein identification by database search as the quality of MS/MSspectra generated by high-energy CID in TOF/TOF is far better than PSDspectra.

The major drawback for this approach is the high cost of the instrument($750,000), the lengthy 2D separations, the sample handling complexitieswith LC fractions, the cumbersome sample preparation processes forMALDI, the intrinsic difficulty in quantification with MALDI, and thehuge informatics challenges for data and sample tracking. Due to the LCseparation and the sample preparation time required, the analysis ofseveral hundred proteins in one sample would take at least 2 days.

It is well recognized that Fourier-Transform Ion-Cyclotron Resonance(FTICR) MS is a powerful technique that can deliver high sensitivity,high mass resolution, wide mass range, and high mass accuracy. Recently,FTICR/MS coupled with LC showed impressive capabilities for proteomicanalysis through Accurate Mass Tags (AMT) (Smith, R. D. et al,Proteomics, 2, 513-523 (2002)). AMT is such an accurate m/z value of apeptide that can be used to exclusively identify a protein. It has beendemonstrated that, using the AMT approach, a single LC/FTICR-MS analysiscan potentially identify more than 10⁵ proteins with mass accuracy ofbetter than 1 ppm. Nonetheless, ATM alone may not be sufficient topinpoint amino acid residue specific post-translational modifications ofpeptides. In addition, the instrument is prohibitively expensive at acost of $750K or more with high maintenance requirements.

Protein arrays and protein chips are emerging technologies (Issaq, H. J.et al, Biochem Biophys Res Commun. 292(3), 587-592 (2002)) similar inthe design concept to the oligonucleotide-chip used in gene expressionprofiling. Protein arrays consist of protein chips which containchemically (cationic, anionic, hydrophobic, hydrophilic, etc.) orbiochemically (antibody, receptor, DNA, etc.) treated surfaces forspecific interaction with the proteins of interest. These technologiestake advantages of the specificity provided by affinity chemistry andthe high sensitivity of MADLI/TOF and offer high throughput detection ofproteins. In a typical protein array experiment, a large number ofprotein samples can be simultaneously applied to an array of chipstreated with specific surface chemistries. By washing away undesiredchemical and biomolecular background, the proteins of interest aredocked on the chips due to affinity capturing and hence “purified”.Further analysis of individual chip by MALDI-TOF results in the proteinprofiles in the samples. These technologies are ideal for theinvestigation of protein-protein interactions, since proteins can beused as affinity reagents to treat the surface to monitor theirinteraction with other specific proteins. Another useful application ofthese technologies is to generate comparative patterns between normaland diseased tissue samples as a potential tool for disease diagnostics.

Due to the complicated surface chemistries involved and the addedcomplications with proteins or other protein-like binding agents such asdenaturing, folding, and solubility issues, protein arrays and chips arenot expected to have as wide an application as gene chips or geneexpression arrays.

Thus, the past 100 years have witnessed tremendous strides made on theMS instrumentation with many different types of instruments designed andbuilt for high throughput, high resolution, and high sensitivity work.The instrumentation has been developed to a stage where single iondetection can be routinely accomplished on most commercial MS systemswith unit mass resolution allowing for the observation of ion fragmentscoming from different isotopes. In stark contrast to the sophisticationin hardware, very little has been done to systematically and effectivelyanalyze the massive amount of MS data generated by modern MSinstrumentation.

In a typical mass spectrometer, the user is usually required or suppliedwith a standard material having several fragment ions covering the massspectral m/z range of interest. Subject to baseline effects, isotopeinterferences, mass resolution, and resolution dependence on mass, peakpositions of a few ion fragments are determined either in terms ofcentroids or peak maxima through a low order polynomial fit at the peaktop. These peak positions are then fit to the known peak positions forthese ions through either 1^(st) or other higher order polynomial fit tocalibrate the mass (m/z) axis.

After the mass axis calibration, a typical mass spectral data tracewould then be subjected to peak analysis where peaks (ions) areidentified. This peak detection routine is a highly empirical andcompounded process where peak shoulders, noise in data trace, baselinesdue to chemical backgrounds or contamination, isotope peakinterferences, etc., are considered.

For the peaks identified, a process called centroiding is typicallyapplied to attempt to calculate the integrated peak areas and peakpositions. Due to the many interfering factors outlined above and theintrinsic difficulties in determining peak areas in the presence ofother peaks and/or baselines, this is a process plagued by manyadjustable parameters that can make an isotope peak appear or disappearwith no objective measures of the centroiding quality.

Thus, despite their apparent sophistication current approaches haveseveral pronounced disadvantages. These include:

Lack of Mass Accuracy. The mass calibration currently in use usuallydoes not provide better than 0.1 amu (m/z unit) in mass determinationaccuracy on a conventional MS system with unit mass resolution (abilityto visualize the presence or absence of a significant isotope peak).

In order to achieve higher mass accuracy and reduce ambiguity inmolecular fingerprinting such as peptide mapping for proteinidentification, one has to switch to an MS system with higher resolutionsuch as quadrupole TOF (qTOF) or FT ICR MS which come at significantlyhigher cost.

Large Peak Integration Error. Due to the contribution of mass spectralpeak shape, its variability, the isotope peaks, the baseline and otherbackground signals, and the random noise, current peak area integrationhas large errors (both systematic and random errors) for either strongor weak mass spectral peaks.

Difficulties with Isotope Peaks. Current approach does not have a goodway to separate the contributions from various isotopes which usuallygive out partially overlapped mass spectral peaks on conventional MSsystems with unit mass resolution. The empirical approaches used eitherignore the contributions from neighboring isotope peaks or over-estimatethem, resulting in errors for dominating isotope peaks and large biasesfor weak isotope peaks or even complete ignorance of the weaker peaks.When ions of multiple charges are concerned, the situation becomes worseeven, due to the now reduced separation in mass unit between neighboringisotope peaks.

Nonlinear Operation. The current approaches use a multi-stage disjointedprocess with many empirically adjustable parameters during each stage.Systematic errors (biases) are generated at each stage and propagateddown to the later stages in an uncontrolled, unpredictable, andnonlinear manner, making it impossible for the algorithms to reportmeaningful statistics as measures of data processing quality andreliability.

Dominating Systematic Errors. In most of MS applications, ranging fromindustrial process control and environmental monitoring to proteinidentification or biomarker discovery, instrument sensitivity ordetection limit has always been a focus and great efforts have been madein many instrument systems to minimize measurement error or noisecontribution in the signal. Unfortunately, the peak processingapproaches currently in use create a source of systematic error evenlarger than the random noise in the raw data, thus becoming the limitingfactor in instrument sensitivity or reliability.

Mathematical and Statistical Inconsistency. The many empiricalapproaches used currently make the whole mass spectral peak processinginconsistent either mathematically or statistically. The peak processingresults can change dramatically on slightly different data without anyrandom noise or on the same synthetic data with slightly differentnoise. In order words, the results of the peak processing are not robustand can be unstable depending on the particular experiment or datacollection.

Instrument-To-Instrument Variations. It has usually been difficult todirectly compare raw mass spectral data from different MS instrumentsdue to variations in the mechanical, electromagnetic, or environmentaltolerances. With the current ad hoc peak processing applied on the rawdata, it only adds to the difficulty of quantitatively comparing resultsfrom different MS instruments. On the other hand, the need for comparingeither raw mass spectral data directly or peak processing results fromdifferent instruments or different types of instruments has beenincreasingly heightened for the purpose of impurity detection or proteinidentification through the searches in established MS libraries.

A second order instrument generates a matrix of data for each sample andcan have a higher analytical power than first order instruments if thedata matrix is properly structured. The most widely used proteomicsinstrument, LC/MS, is a typical example of second order instrumentcapable of potentially much higher analytical power than what iscurrently achieved. Other second order proteomics instruments includeLC/LC with single UV wavelength detection, 1D gel with MALDI-TOF MSdetection, 1D protein arrays with MALDI MS detection, etc.

Two-dimensional gel electrophoresis (2D gel) has been widely used in theseparation of proteins in complex biological samples such as cells orurines. Typically the spots formed by the proteins are stained withsilver for easy identification with visible imaging systems.

These spots are subsequently excised, dissolved/digested with enzymes,transported onto MALDI targets, dried, and analyzed for peptidesignatures using MALDI time-of-flight mass spectrometer.

Several complications arise from this process:

1. The protein spots are not guaranteed to contain only single proteins,especially at extreme ends of the separation parameters (pI for chargeor MW for molecular weight). This usually makes peptide searchingdifficult if not impossible. Additional liquid chromatography separationmay be required for each excised spot, which further slows down theanalysis.2. The conversion of biological sample from liquid phase to solid phase(on the gel), back into liquid phase (for digestion), and finally intosolid phase again (for MALDI TOF analysis) is a very cumbersome processprone to errors, carry-overs, and contaminations.3. Due to the sample conversion processes involved and the fact theMALDI-TOF irreproducibility in sampling and ionization, this analysishas been widely recognized as only qualitative and not quantitative.

Thus, in spite of its tremendous potential and clear advantages overfirst and zeroth order analysis, second order instrument and analysishave so far been limited to academic research where the sample iscomposed of a few synthetic analytes with no sign of commercialization.There are several barriers that must be crossed in order for thisapproach to reach its huge potential. These include:

a. In second order protein analysis, it is even more important to useraw profile MS scans instead of the centroid data currently used invirtually all MS applications. To maintain the bilinear data structure,successive MS scans of a particular ion eluting from LC needs to havethe same mass spectral peak shape (obviously at different peak heights),a critical second order structure destroyed by centroiding andde-isotoping (summing all isotope peaks into one integrated area) Thesticks from centroiding data appear at different mass locations (up to0.5 amu error) from successive MS scans of the same ion.b. Higher order instrument and analysis requires more robust instrumentand measurement process and artifacts such as shifts in one or two ofthe dimensions can severely compromise the quantitative and even thequalitative results of the analysis (Wang, Y. et al, Anal. Chem. 63,2750 (1991); Wang, Y. et al, Anal. Chem., 65, 1174 (1993); Kiers, H. A.L. et al, J. Chemometrics 13, 275 (1999)), in spite of the recentprogress made in academia (Bro, R. et al, J. Chemometrics 13, 295(1999)). Other artifacts such as non-linearity or non-bilinearity couldalso lead to complications (Wang, Y. et al, J. Chemometrics, 7, 439(1993)). Standardization and algorithmic corrections need to bedeveloped in order to maintain the bilinearity of second orderproteomics data.c. In many MS instruments such as quadrupole MS, the mass spectral scantime is not negligible compared to the protein or peptide elution time.Therefore, a significant skew would exist where the ions measured in onemass spectral scan comes from different time points during the LCelution, similar to what has been reported for GC/MS (Stein, S. E. etal, J. Am. Soc. Mass Spectrom. 5, 859 (1994)).

Thus, there exists a significant gap between where the proteomicsresearch would like to be and where it is at the present.

SUMMARY OF THE INVENTION

It is an object of the invention to provide a chemical analysis system,which may include a mass spectrometer, and a method for operating achemical analysis system that overcomes the disadvantages describedabove.

It is another object of the invention to provide a storage media havingthereon computer readable program code for causing a chemical analysis,including a chemical analysis system having a mass spectrometer, systemto perform the method in accordance with the invention.

These objects and others are achieved in accordance with a first aspectof the invention by using 2D gel imaging data acquired from intactproteins to perform both qualitative and quantitative analysis withoutthe use of mass spectrometer in the presence of protein spot overlaps.In addition the invention facilitates direct quantitative comparisonsbetween many different samples collected over either a wider populationrange (diseased and healthy), over a period of time on the samepopulation (development of disease), and over different treatmentmethods (response to potential treatment), etc. The gel spot alignmentand matching are automatically built into the data analysis to yield thebest overall results. The approach in accordance with the inventionrepresents a fast, inexpensive, quantitative, and qualitative tool forboth protein identification and protein expression analysis.

Generally, the invention is directed to a method for analyzing dataobtained from at least one sample in a separation system that has acapability for separating components of a sample containing more thanone component as a function of at least two different variables, themethod comprising obtaining data representative of the at least onesample from the system, the data being expressed as a function of thetwo variables; forming a data stack having successive levels, each levelcontaining successive data representative of the at least one sample;forming a data array representative of a compilation of all of the datain the data stack; and separating the data array into a series ofmatrixes, the matrixes being: a concentration matrix representative ofconcentration of each component in the sample; a first profile of thecomponents as a function of a first of the variables; and a secondprofile of the components as a function of a second of the variables.There may be only one, or a single sample, and the successive data isrepresentative of the sample as a function of time. Successive data maybe representative of the single sample as a function of mass of itscomponents. Alternatively, there may be a plurality of samples, and thesuccessive data is then representative of successive samples.

The invention is more specifically directed to a method for analyzingdata obtained from multiple samples in a separation system that has acapability for separating components of a sample containing more thanone component as a function of two different variables, the methodcomprising obtaining data representative of multiple samples from thesystem, the data being expressed as a function of the two variables;forming a data stack having successive levels, each level containing oneof the data samples; forming a data array representative of acompilation of all of the data in the data stack; and separating thedata array into a series of matrixes, the matrixes being: aconcentration matrix representative of concentration of each componentin the sample; a first profile of the components as a function of thefirst variable; and a second profile of the components as a function ofthe second variable. The first profile and the second profile arerepresentative of profiles of substantially pure components. The methodfurther comprises performing qualitative analysis using at least one ofthe first profile and the second profile.

The method may further comprise standardizing data representative of asample by performing a data matrix multiplication of such data into theproduct of a first standardization matrix, the data itself, and a secondstandardization matrix, to form a standardized data matrix. Terms in thefirst standardization matrix and the second standardization matrix mayhave values that cause the data to be represented at positions withrespect to the two variables, which are different in the standardizeddata matrix from those in the data array. The first standardizationmatrix shifts the data with respect to the first variable, and thesecond standardization matrix shifts the data with respect to the secondvariable. Terms in the first standardization matrix and the secondstandardization matrix have values that serve to standardizedistribution shapes of the data with respect to the first and secondvariable, respectively. Terms in the first standardization matrix andthe second standardization matrix may be determined by applying a samplehaving known components to the apparatus; and selecting terms for thefirst standardization matrix and the second standardization matrix whichcause data produced by the known components to be positioned properlywith respect to the first variable and the second variable. The termsmay be determined by selecting terms which produce a smallest error inposition of the data with respect to the first variable and the secondvariable in the standardized data matrix. The terms of the firststandardization matrix and the second standardization matrix arepreferably computed for each sample, and so as to produce a smallesterror over all samples. At least one of the first and secondstandardization matrices can be simplified to be either a diagonalmatrix or an identity matrix. The terms in the first standardizationmatrix and the second standardization matrix may be based onparameterized known functional dependence of the terms on the variables.

Values of terms in the first standardization matrix and the secondstandardization matrix are determined by solving the data array R:

where Q (m×k) contains pure profiles of all k components with respect tothe first variable, W (n×k) contains pure profiles with respect to thesecond variable for the components, C (p×k) contains concentrations ofthese components in all p samples, I is a new data array with scalars onits super-diagonal as the only nonzero elements, and E (m×n×p) is aresidual data array.

The separation apparatus may be a two-dimensional electrophoresisseparation system, wherein the first variable is isoelectric point andthe second variable is molecular weight.

The variables may be a result of any combination, in no particularsequence, and including self-combination, of chromatographic separation,capillary electrophoresis separation, gel-based separation, affinityseparation and antibody separation.

The two variables may be mass associated with the mass axis of a massspectrometer.

The apparatus may further comprise a chromatography system for providingthe samples to the mass spectrometer, retention time being another ofthe two variables.

The apparatus may further comprise an electrophoresis separation systemfor providing the samples to the mass spectrometer, migrationcharacteristics of the sample being another of the two variables.

In the method the data is preferably continuum mass spectral data.Preferably, the data is used without centroiding. The data may becorrected for time skew. Preferably, a calibration of the data withrespect to mass and mass spectral peak shapes is performed.

One of the first variable and the second variable may be that of aregion on a protein chip having a plurality of protein affinity regions.

The method may further comprise obtaining data for the data array byusing a single channel analyzer and by analyzing the samplessuccessively. The single channel detector may be based on one of lightabsorption, light emission, light reflection, light transmission, lightscattering, refractive index, electrochemistry, conductivity,radioactivity, or any combination thereof. The components in the samplemay be bound to at least one of fluorescence tags, isotope tags, stains,affinity tags, or antibody tags.

The invention is also directed to a computer readable medium havingthereon computer readable code for use with a chemical analysis systemhaving a data analysis portion for analyzing data obtained from multiplesamples, the chemical analysis system having a separation portion thathas a capability for separating components of a sample containing morethan one component as a function of two different variables, thecomputer readable code being for causing the computer to perform amethod comprising obtaining data representative of multiple samples fromthe system, the data being expressed as a function of the two variables;forming a data stack having successive levels, each level containing oneof the data samples; forming a data array representative of acompilation of all of the data in the data stack; and separating thedata array into a series of matrixes, the matrixes being: aconcentration matrix representative of concentration of each componentin the sample; a first profile of the components as a function of thefirst variable; and a second profile of the components as a function ofthe second variable. The computer readable medium may further comprisecomputer readable code for causing the computer to analyze data byperforming the steps of any one of the methods stated above.

The invention is further directed to a chemical analysis system foranalyzing data obtained from multiple samples, the system having aseparation system that has a capability for separating components of asample containing more than one component as a function of two differentvariables, the system having apparatus for performing a methodcomprising obtaining data representative of multiple samples from thesystem, the data being expressed as a function of the two variables;forming a data stack having successive levels, each level containing oneof the data samples; forming a data array representative of acompilation of all of the data in the data stack; and separating thedata array into a series of matrixes, the matrixes being: aconcentration matrix representative of concentration of each componentin the sample; a first profile of the components as a function of thefirst variable; and a second profile of the components as a function ofthe second variable. The chemical analysis system may have facilitiesfor performing the steps of any of the methods described above.

The invention further includes a method for analyzing data obtained froma sample in a separation system that has a capability for separatingcomponents of a sample containing more than one component, the methodcomprising separating the sample with respect to at least a firstvariable to form a separated sample; separating the separated samplewith respect to at least a second variable to form a further separatedsample; obtaining data representative of the further separated samplefrom a multi-channel analyzer, the data being expressed as a function ofthree variables; forming a data stack having successive levels, eachlevel containing data from one channel of the multi-channel analyzer;forming a data array representative of a compilation of all of the datain the data stack; and separating the data array into a series ofmatrixes or arrays, the matrixes or arrays being: a concentration dataarray representative of concentration of each component in the sample onits super-diagonal; a first profile of each component as a function of afirst variable; a second profile of each component as a function of asecond variable; and a third profile of each component as a function ofa third variable. The first profile, the second profile, and the thirdprofile are representative of profiles of substantially pure components.The method further comprises performing qualitative analysis using atleast one of the first profile, the second profile, and the thirdprofile.

The method further comprises standardizing data representative of asample by performing a data matrix multiplication of such data into theproduct of a first standardization matrix, the data itself, and a secondstandardization matrix, to form a standardized data matrix. Terms in thefirst standardization matrix and the second standardization matrix havevalues that cause the data to be represented at positions with respectto two of the three variables, which are different in the standardizeddata matrix from those in the data array. The first standardizationmatrix shifts the data with respect to one of the two variables, and thesecond standardization matrix shifts the data with respect to the otherof the two variables. Terms in the first standardization matrix and thesecond standardization matrix may have values that serve to standardizedistribution shapes of the data with respect to the two variables,respectively. Terms in the first standardization matrix and the secondstandardization matrix are determined by applying a sample having knowncomponents to the apparatus; and selecting terms for the firststandardization matrix and the second standardization matrix which causedata produced by the known components to be positioned properly withrespect to the two variables.

The terms are determined by selecting terms that produce a smallesterror in position of the data with respect to the two variables, in thestandardized data matrix. The terms of the first standardization matrixand the second standardization matrix may be computed for a singlechannel. The terms of the first standardization matrix and the secondstandardization matrix are computed so as to produce a smallest errorfor the channel.

At least one of the first and second standardization matrices can besimplified to be either a diagonal matrix or an identity matrix.Preferably, the terms in the first standardization matrix and the secondstandardization matrix are based on parameterized known functionaldependence of the terms on the variables.

In accordance with the invention, the values of terms in the firststandardization matrix and in the second standardization matrix aredetermined by solving data array R:

where Q (m×k) contains pure profiles of all k components with respect tothe first variable, W (n×k) contains pure profiles with respect to thesecond variable for the components, C (p×k) contains pure profiles ofthese components with respect to the multichannel detector or the thirdvariable, I (k×k×k) is a new data array with scalars on itssuper-diagonal as the only nonzero elements representing theconcentrations of all the k components, and E (m×n×p) is a residual dataarray.

The separation apparatus used may be a one-dimensional electrophoresisseparation system, wherein the variable is one of isoelectric point andmolecular weight.

The two separation variables may be a result of any combination, in noparticular sequence, and including self-combination, of chromatographicseparation, capillary electrophoresis separation, gel-based separation,affinity separation and antibody separation

One of the three variables may be mass associated with the mass axis ofa mass spectrometer.

The apparatus used may comprise at least one chromatography system forproviding the separated samples to the mass spectrometer, retention timebeing at least one of the variables. The apparatus may also comprise atleast one electrophoresis separation system for providing the separatedsamples to the mass spectrometer, migration characteristics of thesample being at least one of the variables. Preferably, the data iscontinuum mass spectral data. Preferably the data is used withoutcentroiding.

The method may further comprise correcting the data for time skew. Themethod also may further comprise performing a calibration of the datawith respect to mass and spectral peak shapes.

The apparatus used may comprise a protein chip having a plurality ofprotein affinity regions, location of a region being one of the threevariables.

The multi-channel analyzer used may be based on one of light absorption,light emission, light reflection, light transmission, light scattering,refractive index, electrochemistry, conductivity, radioactivity, or anycombination thereof. The components in the sample may be bound to atleast one of fluorescence tags, isotope tags, stains, affinity tags, orantibody tags.

The apparatus used may comprise a two-dimensional electrophoresisseparation system, wherein a first of the at least one variable isisoelectric point and a second of the at least one variable is molecularweight.

The invention is also directed to a computer readable medium havingthereon computer readable code for use with a chemical analysis systemhaving a data analysis portion for analyzing data obtained from asample, the chemical analysis system having a separation portion thathas a capability for separating components of a sample containing morethan one component as a function of at least one variable, the computerreadable code being for causing the computer to perform a methodcomprising separating the sample with respect to at least a firstvariable to form a separated sample; separating the separated samplewith respect to at least a second variable to form a further separatedsample; obtaining data representative of the further separated samplefrom a multi-channel analyzer, the data being expressed as a function ofthree variables; forming a data stack having successive levels, eachlevel containing data from one channel of the multi-channel analyzer;forming a data array representative of a compilation of all of the datain the data stack; and separating the data array into a series ofmatrixes or arrays, the matrixes or arrays being: a concentration dataarray representative of concentration of each component in the sample onits super-diagonal; a first profile of each component as a function of afirst variable; a second profile of each component as a function of asecond variable; and a third profile of each component as a function ofa third variable. The computer readable medium may further comprisecomputer readable code for causing the computer to analyze data byperforming the steps of any of the methods set forth above.

The invention is also directed to a chemical analysis system foranalyzing data obtained from a sample, the system having a separationsystem that has a capability for separating components of a samplecontaining more than one component as a function of at least onevariable, the system having apparatus for performing a method comprisingseparating the sample with respect to at least a first variable to forma separated sample; separating the separated sample with respect to atleast a second variable to form a further separated sample; obtainingdata representative of the further separated sample from a multi-channelanalyzer, the data being expressed as a function of three variables;forming a data stack having successive levels, each level containingdata from one channel of the multi-channel analyzer; forming a dataarray representative of a compilation of all of the data in the datastack; and separating the data array into a series of matrixes orarrays, the matrixes or arrays being: a concentration data arrayrepresentative of concentration of each component in the sample on itssuper-diagonal; a first profile of each component as a function of afirst variable; a second profile of each component as a function of asecond variable; and a third profile of each component as a function ofa third variable. The chemical analysis system may further comprisefacilities for performing the steps of the methods described above.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing aspects and other features of the present invention areexplained in the following description, taken in connection with theaccompanying drawings, wherein like numerals indicate like components,and wherein:

FIG. 1 is a block diagram of an analysis system in accordance with theinvention, including a mass spectrometer.

FIG. 2 is a block diagram of a system having one dimensional sampleseparation, and a multi-channel detector.

FIG. 3 is a block diagram of a system having two dimensional sampleseparation, and a single channel detector.

FIG. 4A, FIG. 4B and FIG. 4C illustrate the compilation ofthree-dimensional data arrays based on two-dimensional measurements, inaccordance with the invention.

FIG. 5 illustrates a three dimensional data array based on singlethree-dimensional measurements with one sample.

FIG. 6 illustrates a three-dimensional data array based ontwo-dimensional liquid phase separation followed by mass spectraldetection.

FIG. 7 illustrates time skew correction for multi-channel detection withsequential scanning.

FIG. 8 is a flow chart of a method of analysis in accordance with theinvention.

FIG. 9 illustrates a transformation for automatic alignment ofseparation axes and corresponding profiles, in accordance with theinvention.

FIG. 10 illustrates direct decomposition of a three-dimensional dataarray.

FIG. 11 illustrates grouping of peptides (a dendrogram) resulting fromenzymatic digestion into proteins through cluster analysis, inaccordance with the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

Referring to FIG. 1, there is shown a block diagram of an analysissystem 10, that may be used to analyze proteins or other molecules, asnoted above, incorporating features of the present invention. Althoughthe present invention will be described with reference to the singleembodiment shown in the drawings, it should be understood that thepresent invention can be embodied in many alternate forms ofembodiments. In addition, any suitable types of components could beused.

Analysis system 10 has a sample preparation portion 12, a massspectrometer portion 14, a data analysis system 16, and a computersystem 18. The sample preparation portion 12 may include a sampleintroduction unit 20, of the type that introduces a sample containingmolecules of interest to system 10, such as Finnegan LCQ Deca XP Max,manufactured by Thermo Electron Corporation of Waltham, Mass., USA. Thesample preparation portion 12 may also include an analyte separationunit 22, which is used to perform a preliminary separation of analytes,such as the proteins to be analyzed by system 10. Analyte separationunit 22 may be any one of a chromatography column, a gel separationunit, such as is manufactured by Bio-Rad Laboratories, Inc. of Hercules,Calif., and is well known in the art. In general, a voltage or PHgradient is applied to the gel to cause the molecules such as proteinsto be separated as a function of one variable, such as migration speedthrough a capillary tube (molecular weight, MW) and isoelectric focusingpoint (Hannesh, S. M., Electrophoresis 21, 1202-1209 (2000)) for onedimensional separation or by more than one of these variables such as byisoelectric focusing and by MW (two dimensional separation). An exampleof the latter is known as SDS-PAGE.

The mass separation portion 14 may be a conventional mass spectrometerand may be any one available, but is preferably one of MALDI-TOF,quadrupole MS, ion trap MS, or FTICR-MS. If it has a MALDI orelectrospray ionization ion source, such ion source may also provide forsample input to the mass spectrometer portion 14. In general, massspectrometer portion 14 may include an ion source 24, a mass spectrumanalyzer 26 for separating ions generated by ion source 24 by mass tocharge ratio (or simply called mass), an ion detector portion 28 fordetecting the ions from mass spectrum analyzer 26, and a vacuum system30 for maintaining a sufficient vacuum for mass spectrometer portion 14to operate efficiently. If mass spectrometer portion 14 is an ionmobility spectrometer, generally no vacuum system is needed.

The data analysis system 16 includes a data acquisition portion 32,which may include one or a series of analog to digital converters (notshown) for converting signals from ion detector portion 28 into digitaldata. This digital data is provided to a real time data processingportion 34, which process the digital data through operations such assumming and/or averaging. A post processing portion 36 may be used to doadditional processing of the data from real time data processing portion34, including library searches, data storage and data reporting.

Computer system 18 provides control of sample preparation portion 12,mass spectrometer portion 14, and data analysis system 16, in the mannerdescribed below. Computer system 18 may have a conventional computermonitor 40 to allow for the entry of data on appropriate screendisplays, and for the display of the results of the analyses performed.Computer system 18 may be based on any appropriate personal computer,operating for example with a Windows® or UNIX® operating system, or anyother appropriate operating system. Computer system 18 will typicallyhave a hard drive 42, on which the operating system and the program forperforming the data analysis described below is stored. A drive 44 foraccepting a CD or floppy disk is used to load the program in accordancewith the invention on to computer system 18. The program for controllingsample preparation portion 12 and mass spectrometer portion 14 willtypically be downloaded as firmware for these portions of system 10.Data analysis system 16 may be a program written to implement theprocessing steps discussed below, in any of several programminglanguages such as C++, JAVA or Visual Basic.

FIG. 2 is a block diagram of an analysis system 50 wherein the samplepreparation portion 12 includes a sample introduction unit 20 and a onedimensional sample separation apparatus 52. By way of example, apparatus52 may be a one dimensional electrophoresis apparatus. Separated samplecomponents are analyzed by a multi-channel detection apparatus 54, suchas, for example a series of ultraviolet sensors, or a mass spectrometer.The manner in which data analysis may be conducted is discussed below.

FIG. 3 is a block diagram of an analysis system 60, wherein the samplepreparation portion 12 includes a sample introduction unit 20 and afirst dimension sample separation apparatus 62 and a second dimensionsample separation apparatus 64. By way of example, first dimensionsample separation apparatus 62 and second dimension sample separationapparatus 64 may be two successive and different liquid chromatographyunits, or may be consolidated as a two-dimensional electrophoresisapparatus. Separated sample components are analyzed by a single channeldetection apparatus 66, such as, for example a ultraviolet sensor with a245 nm bandpass filter, or a gray scale gel imager. Again, the manner inwhich data analysis may be conducted is discussed below.

FIG. 4A illustrates a three-dimensional data array 70 compiled from aseries of two-dimensional arrays 72A to 72N, representative ofsuccessive samples of a mixture of components to be analyzed. Twodimensional data arrays 72A to 72N may be produced by, for example, twodimensional gel electrophoresis, or successive chromatographicseparations, as described above with respect to FIG. 3, or thecombination of other separation techniques.

FIG. 4B illustrates a three-dimensional data array 74 compiled from aseries of two-dimensional arrays 76A to 76N, representative ofsuccessive samples of a mixture of components to be analyzed. Twodimensional data arrays 72A to 72N may be produced by, for example, onedimensional gel electrophoresis, or liquid chromatography, followed bymulti-channel analysis, as described above with respect to FIG. 2, or byother techniques such as gas chromatography/infrared spectroscopy(GC/IR) or LC/Fluorescence.

FIG. 4C illustrates a three-dimensional data array 78 compiled from aseries of two-dimensional arrays 80A to 80N, representative ofsuccessive samples of a mixture of components to be analyzed. Twodimensional data arrays 72A to 72N are produced by, for example, proteinaffinity chips which are able to selectively bind proteins to definedregions (spots) on their surfaces of the type sold by CiphergenBiosystems, Inc. of Fremont, Calif., USA, followed by multi-channelanalysis, such as Surface Enhanced Laser Desorption/Ionization (SELDI)time of flight mass spectrometry, which may be one of the systems, asdescribed above with respect to FIG. 2. Other techniques which may beused are 1D protein array combined with multi-channel fluorescencedetection.

FIG. 5 illustrates a three-dimensional data array 82 compiled from aseries of two-dimensional arrays 84A to 84N, representative of a singlesample of a mixture of components to be analyzed. Two dimensional dataarrays 84A to 84N may be produced by, for example, two-dimensional gelelectrophoresis, or successive liquid chromatography, as described abovewith respect to FIG. 1. Multi-channel detection by, for example massspectrometry, as described above with respect to FIG. 1, that producesdata in the third dimension. Other suitable techniques are 2D LC withmulti-channel UV or fluorescence detection, 2D LC with IR detection, 2Dprotein array with mass spectrometry.

FIG. 6 illustrates a data array 84 obtained by two-dimensional liquidphase separation (for example strong cation exchange chromatographyfollowed by reversed phase chromatography). The third dimension isrepresented by the data along a mass axis 86 from mass spectraldetection.

The data arrays of FIGS. 4A, 4B, 4C, 5 and 6 contain termsrepresentative of all components in all of the samples or of a single,as the case may be (including the components of any calibrationstandards).

FIG. 7 illustrates correction for time skew of the a scanningmulti-channel detector connected to a time-based separation, as is thecase in LC/MS where the LC is connected to a mass spectrometer whichsweeps through a certain mass range during a predetermined scanningtime.

This type of time skew exists for most of mass spectrometers with theexception of simultaneous systems such as a magnetic sector system whichdetects ions of all masses simultaneously. Other examples include GC/IRwhere volatile compounds are separated in terms of retention time afterpassing through a column while IR spectrum is being acquired througheither a scanning monochromator or an interferometer. When atime-dependent event such as a separation or reaction is connected to adetection system that sequentially scans through multiple channels, atime skew is generated where channels scanned earlier correspond to anearlier point in time for the event whereas the channels scanned laterwould correspond to a later point in time for the event. This time skewcan be corrected by way of interpolation on a channel-by-channel basisto generate multi-channel data that correspond to the same point in timefor all channels, i.e., to interpolate for each channel from the solidtilted lines onto the corresponding dashed horizontal lines in FIG. 7.

FIG. 8 is a general flow chart of how sample data is acquired andprocessed in accordance with the invention. Collection and processing ofsamples, such as biological samples, is performed at 100. If a singlesample is being processed, three-dimensional data is acquired at 102. Iftwo-dimensional data is to be acquired with multiple samples at 106, aninternal standard is optionally added to the sample at 104. As describedwith respect to any of the techniques and systems above, athree-dimensional data array is formed at 108. The three-dimensionaldata array undergoes direct decomposition at 110. Different paths areselected at 112 based on whether or not a two-dimensional measurementhas been made. If two-dimensional measurements have been made, pureanalyte profiles in each dimension are obtained at 114 along with theirrelative concentrations across all samples. If three-dimensionalmeasurements have been made on a single sample, pure analyte profilesfor all analytes in the sample along all three dimensions are obtainedat 116. In either case, data interpretation, including analyte grouping,cluster analysis and other types of expression and analysis areconducted at 118 and the results are reported out on display 40 ofcomputer system 18, associated with a system of one of FIG. 1, 2 or 3.

The modes of analysis of the data are described below, with respect tospecific examples, which are provided in order to facilitateunderstanding of, but not by way of limitation to, the scope of theinvention.

If the response matrix, R_(j) (m×n), for a typical sample can beexpressed in the following bilinear form:

$R_{j} = {\sum\limits_{i = 1}^{k}{c_{i}x_{i}y_{i}^{T}}}$

where c_(i) is the concentration of the ith analyte, x_(i) (m×1) is theresponse of this analyte along the row axis (e.g., LC elution profile orchromatogram of this analyte in LC/MS), y_(i) (n×1) is the response ofthis analyte along the column axis (e.g., MS spectrum of this analyte inLC/MS), and k is the number of analytes in the sample. When the responsematrices of multiple samples (j=1, 2, . . . , p) are compiled, a 3D dataarray R (m×n×p) can be formed.

Thus, at the end of a 2D gel run, a gray-scale image can be generatedand represented in a 2D matrix R_(j) (dimensioned m by n, correspondingto m different pI values digitized into rows and n different MW valuesdigitized into columns, for sample j). This raw image data need to becalibrated in both pI and MW axes to yield a standardized image R_(j),

R _(j)=A_(j)R_(j)B_(j)

where A_(j) is a square matrix dimensioned as m by m with nonzeroelements along and around the main diagonal (a banded diagonal matrix)and B_(j) is another square matrix (n by n) with nonzero elements alongand around the main diagonal (another banded diagonal matrix). Thematrices A_(j) and B_(j) can be as simple as diagonal matrices(representing simple linear scaling) or as complex as increasing ordecreasing bandwidths along the main diagonals (correcting for at leastone of band shift, broadening, and distortion or other types ofnon-linearity). A graphical representation of the above equation in itsgeneral form can be given as illustrated in FIG. 9:

$\underset{\underset{\_}{R_{j}}}{\begin{bmatrix}\# & \ldots & \# \\\ldots & \; & \ldots \\\ldots & \ldots & \ldots \\\ldots & \; & \ldots \\\# & \ldots & \#\end{bmatrix}} = {\underset{\underset{\_}{A_{j}}}{\begin{bmatrix}\# & \; & \; & \; & \; \\\# & \# & \; & \; & \; \\\; & \# & O & \# & \; \\\; & \; & O & \# & \; \\\; & \; & \; & \# & \#\end{bmatrix}}\underset{\underset{\_}{R_{j}}}{\begin{bmatrix}\# & \ldots & \# \\\ldots & \; & \ldots \\\ldots & \ldots & \ldots \\\ldots & \; & \ldots \\\# & \ldots & \#\end{bmatrix}}\underset{\underset{\_}{B_{j}}}{\begin{bmatrix}\# & \; & \; & \; & \; \\\# & \# & \; & \; & \; \\\; & \# & O & \# & \; \\\; & \; & O & \# & \; \\\; & \; & \; & \# & \#\end{bmatrix}}}$

When 2-D gel data from multiple samples are collected, a set of R_(j)can be arranged to form a 3D data array R as

$R = \begin{bmatrix}{\underset{\_}{R}}_{1} \\\ldots \\\ldots \\\ldots \\{\underset{\_}{R}}_{p}\end{bmatrix}$

where p is the number of biological samples and with R dimensioned as mby n by p. This data array (in the shape of a cube or rectangular solid)can be decomposed with trilinear decomposition method based on GRAM(Generalized Rank Annihilation Method, direct decomposition throughmatrix operations without iteration, Sanchez, E. et al, J. Chemometrics4, 29 (1990)) or PARAFAC (PARAllel FACtor analysis, iterativedecomposition with alternating least squares, Carroll, J. et al,Psychometrika 3, 45 (1980); Bezemer, E. et al, Anal. Chem. 73, 4403(2001)) into four different arrays and a residual data array E:

where C represents the relative concentrations of all identifiableproteins (k of them with k≦min(m,n)) in all p samples, Q represents thepI profiles digitized at m pI values for each protein (k of them), Wrepresents the molecular weight profiles digitized at n values for eachprotein (ideally a single peak will be observed that corresponds to eachprotein), and I is a new data cube with scalars on its super-diagonal asthe only nonzero elements.

When all proteins are distinct (with differing pI values and differingMW) with expression levels varying in a linearly independent fashionfrom sample to sample, the following direct interpretations of theresults can be expected:

1. The k value from the above decomposition automatically be equal tothe number of proteins.2. Values in each row of matrix C, after scaling with the super-diagonalelements in I, represent the relative concentrations of these proteinsin a particular sample.3. Each column in matrix Q represents the deconvolved pI profile of aparticular protein.4. Each column in matrix W represents the deconvolved MW profile of aparticular protein.

If these proteins are distinct but with correlated expression levelsfrom sample to sample (matrix C with linearly dependent columns), theinterpretation can only be performed on the group of proteins havingcorrelated expression levels, not on each individual proteins, a findingof significance for proteomics research.

Based on the decomposition presented above, the power of suchmultidimensional system and analysis can be immediately seen:

a. As a result of this decomposition that separates the compositeresponses into linear combinations of individual protein responses ineach dimension, the quantitative information can be obtained for eachprotein in the presence of all other proteins.b. The decomposition also separates out the profiles for each individualprotein in each dimension, providing qualitative information for theidentification of these proteins in both dimensions (pI and MW in 2DEand the chromatographic and the mass spectral dimension in LC/MS).C. Each sample in the 3D data array R can contain a different set ofproteins, implying that the proteins of interest can be identified andquantified in the presence of unknown proteins with only the commonproteins shared by all samples in the data array have all nonzeroconcentrations in the decomposed matrix C.d. A minimum of only two distinct samples will be required for thisanalysis, providing for a much better way to perform differentialproteomic analysis without labels such as in ICAT to quickly andreliably pick out the proteins of interest in the presence of otherun-interesting proteins.e. The number of analytes that can be analyzed is limited by the maximumallowable pseudo-rank for each response matrix R_(j), which can easilyreach thousands (ion trap MS) to hundreds of thousands (TOF orFTICR-MS), paving the way for large scale proteomic analysis on complexbiological samples.f. A typical LC/MS run can be completed in less than 2 hours with noother chemical processes or sample preparation steps involved, pointingto at least 10-fold gain in throughput and tremendous simplification ininformatics.g. Since full LC/MS data are used in the analysis, nearly 100% sequencecoverage can be achieved without the MS/MS experiments.

An important advantage of the above analysis, based on an image of the2-D gel separation is that it is non-destructive and one can follow upwith further confirmation through the use of, for example, MALDI TOF.

The above analysis can also be applied to protein digests where allpeptides from the same protein can be treated as a distinct group foranalysis and interpretation. The separation of pI and MW profiles intoindividual proteins can still be performed when separation intoindividual peptides is not feasible.

Left and right transformation matrices A_(j) and B_(j) can be preferablydetermined using internal standards added to each sample. These internalstandards are selected to cover all pI and MW ranges, for example, fiveinternal standards with one on each corner of the 2D gel image and oneright in the center. The concentrations of these internal standardswould vary from one sample to another so that the corresponding matrix Cin the above decomposition can be partitioned as

C=[C _(s) |C _(unk)]

where all columns in C_(s) are independent, i.e., C_(s) is full rank, orbetter yet, the ratio between the largest and the smallest singularvalue is minimized. Now with part of the matrix C known in the abovedecomposition, it is possible to perform the decomposition such that thetransformation matrices A_(j) and B_(j) for each sample (j=1, 2, . . .p) can be determined in the same decomposition process to minimize theoverall residual E. The scale of the problem can be drastically reducedby parameterizing the nonzero diagonal bands in A_(j) and B_(j), forexample, by specifying a band-broadening filter of Gaussian shape foreach row in A_(j) and each column in B_(j) and allowing for smoothvariation of the Gaussian parameters down the rows in A_(j) and acrossthe columns in B_(j). With matrices A_(j) and B_(j) properlyparameterized and analytical forms of derivatives with respect to theparameters derived, an efficient Gauss-Newton iteration approach can beapplied to the trilinear decomposition or PARAFAC algorithm to arrive atboth the desired decomposition and the proper transformation matricesA_(j) and B_(j) for each sample.

Compared with ICAT (isotope-coded affinity tags, Gygi, S. P. et al,Nature Biotech. 1999, 17, 994), this approach is not limited toanalyzing only two samples and does not require peptide sequencing forprotein identifications. The number of samples that can be quantifiedcan be in the hundreds to thousands or even tens of thousands and theprotein identification can be accomplished through the mass spectraldata alone once all these proteins have been mathematically resolved andseparated. Furthermore, there is no additional chemistry involvingisotope labels, which should reduce the risk of losing many importantproteins during the tedious sample preparation stages required for ICAT.

In brief, the present invention, using the method of analysis describedabove, provides a technique for protein identification and proteinexpression analysis using 2D data having the following features:

-   -   2D gel data from multiple samples is used to form a 3D data        array;    -   for each of the following scenarios there will be a different        set of interpretations applicable:

a) where all proteins are distinct with expression levels varyingindependently from sample to sample,

b) where all proteins are distinct with correlated expression levelsfrom sample to sample;

avoids centroiding on mass spectral continuum data;

raw mass spectral data alone can be directly utilized and is sufficientas inputs into the data array decomposition;

full mass spectral calibration, as for example that performed in U.S.patent application Ser. No. 10/689,313, may be optionally performed onthe raw continuum data to obtain fully calibrated continuum data asinputs to the analysis, allowing for even more accurate massdetermination and library search for the purpose of proteinidentification once deconvolved mass spectrum becomes available for anindividual protein after the array decomposition.

this approach is based on mathematics instead of physical sequencing toresolve and separate proteins and does not require peptide sequencingfor protein identifications,

the results are both qualitative and quantitative,

gel spot alignment and matching is automatically built into the dataanalysis.

Furthermore, it is preferred to have fully calibrated continuum massspectral data in this invention to further improve mass alignment andspectral peak shape consistency, as described in co-pending applicationSer. No. 10/689,313, a brief summary of which is set forth below.

Producing Fully Calibrated Continuum Mass Spectral Data

A calibration relationship of the form:

m=f(m ₀)  (Equation A)

can be established through a least-squares polynomial fit between thecentroids measured and the centroids calculated using all clearlyidentifiable isotope clusters available in the mass spectral standardacross the mass range.

In addition to this simple mass calibration, additional full spectralcalibration filters are calculated to serve two purposes simultaneously:the calibration of mass spectral peak shapes and mass spectral peaklocations. Since the mass axis may have been pre-calibrated, the masscalibration part of the filter function is reduced in this case toachieve a further refinement on mass calibration, i.e., to account forany residual mass errors after the polynomial fit given by Equation A.

This total calibration process applies easily to quadrupole-type MSincluding ion traps where mass spectral peak width (Full Width at HalfMaximum or FWHM) is generally roughly consistent within the operatingmass range. For other types of mass spectrometer systems such asmagnetic sectors, TOF, or FTMS, the mass spectral peak shape is expectedto vary with mass in a relationship dictated by the operating principleand/or the particular instrument design. While the same mass-dependentcalibration procedure is still applicable, one may prefer to perform thetotal calibration in a transformed data space consistent with a givenrelationship between the peak width/location and mass.

In the case of TOF, it is known that mass spectral peak width (FWHM) Δmis related to the mass (m) in the following relationship:

Δm=a√{square root over (m)}

where a is a known calibration coefficient. In other words, the peakwidth measured across the mass range would increase with the square rootof the mass. With a square root transformation to convert the mass axisinto a new function as follows:

m′=√{square root over (m)}

where the peak width (FWHM) as measured in the transformed mass axis isgiven by

$\frac{\Delta \; m}{2\sqrt{m}} = \frac{a}{2}$

which will remain unchanged throughout the spectral range.

For an FT MS instrument, on the other hand, the peak width (FWHM) Δmwill be directly proportional to the mass m, and therefore a logarithmtransformation will be needed:

m′=ln(m)

where the peak width (FWHM) as measured in the transformed log-space isgiven by

${\ln \left( \frac{m + {\Delta \; m}}{m} \right)} = {{\ln \left( {1 + \frac{\Delta \; m}{m}} \right)} \approx \frac{\Delta \; m}{m}}$

which will be fixed independent of the mass. Typically in FTMS, Δm/m canbe managed on the order of 10⁻⁵, i.e., 10⁵ in terms of the resolvingpower m/Δm.

For a magnetic sector instrument, depending on the specific design, thespectral peak width and the mass sampling interval usually follow aknown mathematical relationship with mass, which may lend itself aparticular form of transformation through which the expected massspectral peak width would become independent of mass, much like the waythe square root and logarithm transformation do for the TOF and FTMS.

When the expected mass spectral peak width becomes independent of themass, due either to the appropriate transformation such as logarithmictransformation on FTMS and square root transformation on TOF-MS or theintrinsic nature of a particular instrument such as a well designed andproperly tuned quadrupole or ion trap MS, huge savings in computationaltime will be achieved with a single calibration filter applicable to thefull mass spectral range. This would also simplify the requirement onthe mass spectral calibration standard: a single mass spectral peakwould be required for the calibration with additional peak(s) (ifpresent) serving as check or confirmation only, paving the way forcomplete mass spectral calibration of each and every MS based on aninternal standard added to each sample to be measured.

There are usually two steps in achieving total mass spectralcalibration. The first steps is to derive actual mass spectral peakshape functions and the second step is to convert the derive actual peakshape functions into a specified target peak shape functions centered atcorrect mass locations. An internal or external standard with itsmeasured raw mass spectral continuum y₀ is related to the isotopedistribution y of a standard ion or ion fragment by

y₀=y

P

where p is the actual peak shape function to be calculated. This actualpeak shape function is then converted to a specified target peak shapefunction t (a Gaussian of certain FWHM, for example) through one or morecalibration filters given by

t=p

f

The calibration filters calculated above can be arranged into thefollowing banded diagonal filter matrix:

$F = \begin{bmatrix}f_{1} & \; & \; & \; & \; \\\; & \ldots & \; & \; & \; \\\; & \; & f_{i} & \; & \; \\\; & \; & \; & \ldots & \; \\\; & \; & \; & \; & f_{n}\end{bmatrix}$

in which each short column vector on the diagonal, f_(i), is taken fromthe convolution filter calculated above for the corresponding centermass. The elements in f_(i) is taken from the elements of theconvolution filter in reverse order, i.e.,

$f_{i} = \begin{bmatrix}f_{i,m} \\f_{i,{m - 1}} \\\ldots \\\ldots \\\ldots \\f_{i,1}\end{bmatrix}$

As an example, this calibration matrix will have a dimension of 8,000 by8,000 for a quadrupole MS with mass coverage up to 1,000 amu at ⅛ amudata spacing. Due to its sparse nature, however, typical storagerequirement would only be around 40 by 8,000 with an effective filterlength of 40 elements covering a 5-amu mass range.

Returning to the present invention, further multivariate statisticalanalysis can be applied to matrix C to study and understand therelationships between different samples and different proteins. Thesamples and proteins can be grouped or cluster-analyzed to see whichproteins expressed more within what sample groups. For example, adendrogram can be created using the scores or loadings from theprincipal component analysis of the C matrix. Typical conclusionsinclude that cell samples from healthy individuals clustered around eachother while those from diseased individuals would cluster around in adifferent group. For samples collected over a period of time aftercertain treatment, the samples may show a continuous change in theexpression levels of some proteins, indicating a biological reaction tothe treatment on the protein level. For samples collected over a seriesof dosages, the changes in relevant proteins can indicate the effects ofdosages on this set of proteins and their potential regulations.

In the case where proteins are pre-digested into peptides before theanalysis, each column in matrix C would represent a linear combinationof a group of peptides coming from the same protein or a group ofproteins showing similar expression patterns from sample to sample. Adendrogram performed to classify columns in matrix C, such as the oneshown in FIG. 11, would group individual peptides back into theirrespective proteins and thus accomplish the analysis on the proteomelevel.

Qualitative (or signatory) information for the proteins identified canbe found in pI profile matrix Q and MW matrix W. The qualitativeinformation can serve the purpose of protein identification and evenlibrary searching, especially if the molecular weight information isdetermined with sufficient accuracy. In summary, the three matrices C,Q, and W when combined, allow for both protein quantification andidentification with automatic gel matching and spot alignment from thedetermination of transformation matrices represented by A_(j) and B_(j).

The above 2-D data can come in different forms and shapes. Analternative to MALDI-TOF after excising/digesting 2-D gel spots is torun these samples through conventional LC/MS, for example on the ThermalFinnigan LCQ system, to further separate proteins from each gel spotbefore MS analysis. A very important application of this approach allowsfor rapid and direct protein identification and quantitation by avoiding2-D gel (2DE) separation all together, thus increasing the throughput byorders of magnitude. This can be accomplished through the followingsteps:

1. Directly digest the sample containing hundreds and tens of thousandsof proteins without any separation2. Run the digested sample on a conventional LC/MS instrument to obtaina two-dimensional array. It should be noted that MS/MS capability is nota requirement in this case, although one may chose to run the sample ona LC/MS/MS system, which generates additional sequencing information.3. Repeat 1 and 2 for multiple samples to generate a three-dimensionaldata array.4. Decompose the data array using the approach outlined above.5. Replace the pI axis with LC retention time and the MW axis with themass axis in interpretation and mass spectral searching for the purposeof protein identification. The mathematically separated mass spectra canbe further processed through centroiding and de-isotoping to yield stickspectra consistent with conventional databases and search engines suchas Mascot or SwissProt, available online from:http://www.matrixscience.com or from http://us.expasy.org/sprot/. It ispreferable, however, to fully calibrate the raw mass spectral continuumdata into calibrated continuum data prior to the data arraydecomposition to yield fully calibrated continuum mass spectral data foreach deconvolved protein or peptide. This continuum mass spectral datawould then be used along with its high mass accuracy without centroidingfor protein identification through a novel database search in aco-pending patent.

Depending on the nature of the LC column, the LC can act as another formof charge separation, similar to the pI axis in 2-D gel. The massspectrometer in this case serves as a precise means for molecular weightmeasurement, similar to the WM axis in 2-D gel analysis. Due to the highmass accuracy available on a mass spectrometer, the transformationmatrix B_(j) can be reduced to a diagonal matrix to correct formass-dependent ionization efficiency changes or even an identity matrixto be dropped out of the equation, especially after the full massspectral calibration mentioned above. In order to handle large proteinmolecules, the protein sample is typically pre-digested into peptidesthrough the use of enzymatic or chemical reactions, for example,tripsin. Therefore, it is typical to see multiple LC peaks as well asmultiple masses for each protein of interest. While this may addcomplexities for sample handling, it largely enhances the selectivity oflibrary search and protein identification. Multiple digestions may beused to further enhance the selectivity. Taking this to the extreme,each protein may be digested into peptides of varying lengths beforehand(Erdman degradation) to yield complete protein sequence information frommatrix W. This is a new technique for protein sequencing based onmathematics rather than physical sequencing as an alternative to LCtandem mass spectrometry. In applications including MS, the approachdoes not require any data preprocessing on the continuum data from massscans, such as centroiding and de-isotoping as are typically done incommercial instrumentation that are prone to many unsystematic errors.The raw counts data can be supplied and directly utilized as inputs intothe data array decomposition.

Other 2-D data that can yield similar results with identical approachesincludes but is not limited to the following examples that have 2-Dseparation with single point detection, or 1-D separation withmulti-channel detection, or 2-D multi-channel detection:

1. Each 1-D or 2-D gel spot can be treated as an independent sample forthe subsequent LC/MS analysis to generate one LC/MS 2-D data array foreach spot and a data array containing all gel spots and their LC/MS dataarrays. Due to the added resolving power gained from both gel and LCseparation, more proteins can be more accurately identified.2. Other types of 2-D separation, such as pI/hydrophobicity,MW/hydrophobicity, or a 1-D separation using either pI, MW, orhydrophobicity and a form of multi-channel electromagnetic or massspectral detection, such as 1-D gel combined with on-the-gel MALDI TOF,or LC/TOF, LC/UV, LC/Fluorescence, etc. can be used.3. Other types of 2-D separations such as 2-D liquid chromatography,with a single-channel detection (UV at 245 nm or fluorescence-tagged tobe measured at one wavelength) can be used.4. 1D or 2D protein arrays coupled with mass spectral or othermulti-channel detection where each element on the array captures aparticular combination of proteins in a way not dissimilar to LC columnscan be used. These 1D or 2D spots can be arranged into one dimension ofthe 2-D array with the other dimension being mass spectrometry. Theseprotein spots are similar to sensor arrays such as Surface Acoustic WaveSensors (SAWs, coated with GC column materials to selectively bind to acertain class of compounds) or electronic noses such as conductivepolymer arrays on which a binding event would generate a distinctelectrical signal.5. Multi-wavelength emission and excitation fluorescence (EEM) on singlesample with different proteins tagged differentially or specific to asegment of the protein sequence can be used.

In second order proteomics analysis, the data array is formed by the 2Dresponse matrices from multiple samples. Another effective way to createa data array is to include one more dimension in the measurement itselfsuch that a data array can be generated from a single sample on what iscalled a third order instrument. One such instrument starting to receivewide attention in proteomics is LC/LC/MS, amenable to the samedecomposition to yield mathematically separated elution profiles in bothLC dimensions and MS spectral responses for each protein present in thesample.

Thus, while the two-dimensional approaches outlined above are majorimprovements in the art, a three-dimensional approach has the advantagesof being much faster, more reproducible, and simplicity arising from thefact that the sample stays in the liquid phase throughout the entireprocess. However, since many proteins are too large for conventionalmass spectrometers, and all proteins in the sample may be digested intopeptide fragments before LC separation and mass spectral detection, thenumber of peptides and the complexity of the system increases by atleast one order of magnitude. This results in what appears to be aninsurmountable problem for data handling and data interpretation. Inaddition, available approaches stop short at only the level ofqualitative protein identification for samples of very limitedcomplexity such as yeast (Washburn, M. P. et al, Nat. Biotechnol. 19,242-247 (2001)). The approach presented below achieves bothidentification and quantification of anywhere from hundreds and up totens of thousands of proteins in a single two-dimensional liquidchromatography-mass spectrometry (LC/LC/MS or 2D-LC/MS) run.

By way of example, either size exclusion and reversed phase liquidchromatography (SEC-RPLC) or strong cation exchange and reversed phaseliquid chromatography (SCX-RPLC) can be used for initial separation.This is followed by mass spectrometry detection (MS) in the form ofeither electro-spray ionization (ESI) mass spectrometry ortime-of-flight mass spectrometry. The set of data generated are arrangedinto a three dimensional data array, R, that contains mass intensity(count) data at different combinations of retention times (t₁ and t₂,corresponding to the retention times in each LC dimension, for example,SEC and RPHL retention times, digitized at m and n different timepoints) and masses (digitized at p different values covering the massrange of interest). A graphical representation of this data array isprovided in FIG. 6.

It is important to note that while the mass spectral data can bepreprocessed into stick spectral form through centroiding andde-isotoping, it is not desired for this approach to work. Raw massspectral continuum data can work better, due to the preservation ofspectral peak shape information throughout the analysis and theelimination of all types of centroiding and de-isotoping errorsmentioned above. A preferable approach is to fully calibration thecontinuum raw mass spectral data into calibrated continuum data toachieve high mass accuracy and allow for a more accurate library search.

At each retention time combination of t₁ and t₂ in data array R(dimensioned as m by n by p), the fraction of the sample injected intothe mass spectrometer is composed of some linear combinations of asubset of the peptides in the original sample. This fraction of thesample is likely to contain somewhere between a few peptides to a fewtens of thousands of peptides. The mass spectrum corresponding to such asample fraction is likely to be very complex and, as noted above, thechallenges of resolving such a mix into individual proteins for proteinidentification and especially quantification would seem to beinsurmountable.

However, the three-dimensional data array, as noted above with respectto two-dimensional analysis, can be decomposed with trilineardecomposition method based on GRAM (Generalized Rank AnnihilationMethod, direct decomposition through matrix operation without iteration)or PARAFAC (PARAllel FACtor analysis, iterative decomposition withalternating least squares) into four different matrices and a residualdata cube E as noted above.

In this three-dimensional analysis C represents the chromatograms withrespect to t₁ of all identifiable peptides (k of them with k≦min(m,n)),Q represents the chromatograms with respect to t₂ of all identifiablepeptides (k of them), W represents the deconvolved continuum massspectra of all peptides (k of them), and I is a new data array withscalars on its super-diagonal as the only nonzero elements. In otherwords, through the decomposition of this data array, the two retentiontimes (t₁ and t₂) have been identified for each and every peptideexisting in the sample, along with precise determination of the massspectral continuum for each peptide contained in W.

The foregoing analysis yield information on the peptide level, unlessintact proteins are directly analyzed without digestion and with a massspectrometer capable of handling larger masses. The protein levelinformation, however, can be obtained from multiple samples through thefollowing additional steps may be taken:

1. Perform the 2D-LC/MS runs as described above for multiple samples (1of them) collected over a period of time with the same treatment, or ata fixed time with different dosages of treatment, or from multipleindividuals at different disease states.2. Perform the data decomposition for each sample as described above andfully identify all the peptides with each sample.3. The relative concentrations of all peptides in each sample can beread directly from the super-diagonal elements in I. A new matrix Scomposed of these concentrations across all samples can be formed withdimensions of l samples by q distinct peptides in all samples (q≧max(k₁,k₂, . . . , k_(p)) where k_(i) is the number of peptides in sample i(i=1, 2, . . . , p)). For samples that do not contain some of thepeptides existing in other samples, the entries in the correspondingrows for these peptides (arranged in columns) would be zeros.4. A statistical study of the matrix S will allow for examination of thepeptides that change in proportion to each other from one sample toanother. These peptides could potentially correspond to all the peptidescoming from the same protein. A dendrogram based on Mahalanobis distancecalculated from singular value decomposition (SVD) or principalcomponent analysis (PCA) of the S matrix can indicate theinter-connectedness of these peptides. It should be pointed out,however, that there would be groups of proteins that vary in tandem fromone sample to another and thus all their corresponding peptides would begrouped into the same cluster. A graphical representation of thisprocess is provided in FIG. 11.5. The matrix S so partitioned according to the grouping aboverepresents the results of differential proteomics analysis showing thedifferent protein expression levels across many samples.6. For all peptides in each group identified in step 6 immediatelyabove, the resolved mass spectral responses contained in W are combinedto form a composite mass spectral signature of all peptides contained ineach protein or group of proteins that change in tandem in theirexpression levels. Such composite mass spectrum can be either furtherprocessed into stick/centroid spectrum (if has not so processed already)or preferably searched directly against standard protein databases suchas Mascot and SwissProt for protein identification using continuum massspectral data as disclosed in the co-pending application.

Comparing to ICAT (Gygi, S. P. et al, Nat. Biotechnol. 17, 994-999(1999)), the quantitation proposed here does not require any additionalsample preparation, has the potential of handling many thousands ofsamples, and uses all available peptides (instead of a few available forisotope-tagging) in an overall least squares fit to arrive at relativeprotein expression levels. Due also to the mathematical isolation of allpeptides and the later grouping back into proteins, the proteinidentification can be accomplished without peptide sequencing as is thecase for ICAT. In the case of intact protein 2D-LC/MS analysis, allprotein concentrations can be directly read off the super-diagonal in I,without any further re-grouping. It may however still to desirable toform the S matrix as above and perform statistical analysis on thematrix for the purpose of differential proteomics or protein expressionanalysis.

In brief, the present invention provides a method for proteinidentification and protein expression analysis using three dimensionaldata having the following features:

-   -   the set of data generated from either of the following        methodologies is arranged into a 3D data array:    -   a) size exclusion and reversed phase liquid chromatography        (SEC-RPLC), or    -   b) strong cation exchange and reversed phase liquid        chromatography (SCX-RPLC), coupled with either:    -   i) electro-spray ionization (ESI) mass spectrometry for peptides        after protein digestion, or    -   ii) time-of-flight (TOF) mass spectrometry for peptides or        intact proteins;    -   here, the mass spectral data does not have to be preprocessed        through centroiding and/or de-isotoping, though it is preferred        to fully calibrate the raw mass spectral continuum;    -   mass spectral continuum data can be used directly and is in fact        preferred, thus preserving spectral peak shape information        throughout the analysis;    -   this approach is a method of mathematical isolation of all        peptides and then later grouping back into proteins, thus the        protein identification can be done without peptide sequencing;    -   the present invention provides a quantitative tool that does not        require any additional sample preparation, has the potential of        handling many thousands of samples, and uses all available        peptides in an overall least squares fit to arrive at relative        protein expression levels.

The above 3-D data can come in different forms and shapes. Analternative to 2D-LC/MS is to perform 2D electrophoresis separationcoupled with electrospray ionization (ESI) mass spectrometry(conventional ion-trap or quadrupole-MS or TOF-MS). The analyticalapproach and process is identical to those described above. Other typesof 3D data amenable to this approach include but are not limited to:

2D-LC with other multi-channel spectral detection by UV, fluorescence(with sequence-specific tags or tags whose fluorescence is affected by asegment of the protein sequence), etc.

3D electrophoresis or 3D LC with a single channel detection (UV at 245nm, for example). The 3D separation can be applied to intact proteins toseparate, for example, in pI, MW, and hydrophobicity.

1D electrophoresis followed by 1D-LC/MS on either digested or intactproteins.

2D gel separation followed by MS multi-channel detection. If digestionis needed, it can be accomplished on the gel with the proper MALDImatrix for on the gel TOF analysis.

Other 2D means of separation coupled with multi-channel detection.

1D separation coupled with 2D spectral detection, LC/MS/MS.

1D LC or 1D gel electrophoresis coupled with 2D spectral detection, forexample, excitation-emission 2D fluorescence (EEM).

The methods of analysis of the present invention can be realized inhardware, software, or a combination of hardware and software. Any kindof computer system—or other apparatus adapted for carrying out themethods and/or functions described herein—is suitable. A typicalcombination of hardware and software could be a general purpose computersystem with a computer program that, when being loaded and executed,controls the computer system, which in turn control an analysis system,such that the system carries out the methods described herein. Thepresent invention can also be embedded in a computer program product,which comprises all the features enabling the implementation of themethods described herein, and which—when loaded in a computer system(which in turn control an analysis system), is able to carry out thesemethods.

Computer program means or computer program in the present contextinclude any expression, in any language, code or notation, of a set ofinstructions intended to cause a system having an information processingcapability to perform a particular function either directly or afterconversion to another language, code or notation, and/or reproduction ina different material form.

Thus the invention includes an article of manufacture which comprises acomputer usable medium having computer readable program code meansembodied therein for causing a function described above. The computerreadable program code means in the article of manufacture comprisescomputer readable program code means for causing a computer to effectthe steps of a method of this invention. Similarly, the presentinvention may be implemented as a computer program product comprising acomputer usable medium having computer readable program code meansembodied therein for causing a function described above. The computerreadable program code means in the computer program product comprisingcomputer readable program code means for causing a computer to effectone or more functions of this invention. Furthermore, the presentinvention may be implemented as a program storage device readable bymachine, tangibly embodying a program of instructions executable by themachine to perform method steps for causing one or more functions ofthis invention.

It is noted that the foregoing has outlined some of the more pertinentobjects and embodiments of the present invention. The concepts of thisinvention may be used for many applications. Thus, although thedescription is made for particular arrangements and methods, the intentand concept of the invention is suitable and applicable to otherarrangements and applications. It will be clear to those skilled in theart that other modifications to the disclosed embodiments can beeffected without departing from the spirit and scope of the invention.The described embodiments ought to be construed to be merelyillustrative of some of the more prominent features and applications ofthe invention. Thus, it should be understood that the foregoingdescription is only illustrative of the invention. Various alternativesand modifications can be devised by those skilled in the art withoutdeparting from the invention. Other beneficial results can be realizedby applying the disclosed invention in a different manner or modifyingthe invention in ways known to those familiar with the art. Thus, itshould be understood that the embodiments has been provided as anexample and not as a limitation. Accordingly, the present invention isintended to embrace all alternatives, modifications and variances whichfall within the scope of the appended claims.

1. A method for analyzing data obtained from a sample in a separationsystem that has a capability for separating components of a samplecontaining more than one component, said method comprising: separatingsaid sample with respect to at least a first variable to form aseparated sample; separating said separated sample with respect to atleast a second variable to form a further separated sample; obtainingdata representative of said further separated sample from amulti-channel analyzer, said data being expressed as a function of threevariables; forming a data stack having successive levels, each levelcontaining data from one channel of said multi-channel analyzer; forminga data array representative of a compilation of all of the data in saiddata stack; and separating said data array into a series of matrixes orarrays, said matrixes or arrays being: a concentration data arrayrepresentative of concentration of each component in said sample on itssuper-diagonal; a first profile of each component as a function of afirst variable; a second profile of each component as a function of asecond variable; and a third profile of each component as a function ofa third variable.
 2. The method of claim 1, wherein said first profile,said second profile, and said third profile are representative ofprofiles of substantially pure components.
 3. The method of claim 1,further comprising performing qualitative analysis using at least one ofsaid first profile, said second profile, and said third profile.
 4. Themethod of claim 1, further comprising standardizing data representativeof a sample by performing a data matrix multiplication of such data intothe product of a first standardization matrix, the data itself, and asecond standardization matrix, to form a standardized data matrix. 5.The method of claim 4, wherein terms in said first standardizationmatrix and said second standardization matrix have values that causesaid data to be represented at positions with respect to two of saidthree variables, which are different in said standardized data matrixfrom those in said data array.
 6. The method of claim 5, wherein saidfirst standardization matrix shifts said data with respect to one ofsaid two variables, and said second standardization matrix shifts saiddata with respect to the other of said two variables.
 7. The method ofclaim 5, wherein terms in said first standardization matrix and saidsecond standardization matrix have values that serve to standardizedistribution shapes of the data with respect to said two variables,respectively.
 8. The method of claim 4, wherein terms in said firststandardization matrix and said second standardization matrix aredetermined by: applying a sample having known components to saidapparatus; and selecting terms for said first standardization matrix andsaid second standardization matrix which cause data produced by saidknown components to be positioned properly with respect to the twovariables.
 9. The method of claim 8, wherein said terms are determinedby selecting terms which produce a smallest error in position of saiddata with respect to the two variables, in said standardized datamatrix.
 10. The method of claim 9, wherein the terms of said firststandardization matrix and said second standardization matrix arecomputed for a single channel.
 11. The method of claim 10, wherein termsof said first standardization matrix and said second standardizationmatrix are computed so as to produce a smallest error for the channel.12. The method of claim 4, wherein at least one of the first and secondstandardization matrices can be simplified to be either a diagonalmatrix or an identity matrix.
 13. The method of claim 4, wherein theterms in said first standardization matrix and said secondstandardization matrix are based on parameterized known functionaldependence of said terms on said variables.
 14. The method of claim 4,wherein values of terms in said first standardization matrix and in saidsecond standardization matrix are determined by solving data array R:

where Q (m×k) contains pure profiles of all k components with respect tothe first variable, W (n×k) contains pure profiles with respect to thesecond variable for the components, C (p×k) contains pure profiles ofthese components with respect to the multichannel analyzer or the thirdvariable, I (k×k×k) is a new data array with scalars on itssuper-diagonal as the only nonzero elements representing theconcentrations of all said k components, and E (m×n×p) is a residualdata array.
 15. The method of claim 1, wherein one of said separationapparatus is a one-dimensional electrophoresis separation system. 16.The method of claim 15, wherein said variable is one of isoelectricpoint and molecular weight.
 17. The method of claim 1, wherein said twoseparation variables are a result of any combination, in no particularsequence, and including self-combination, of chromatographic separation,capillary electrophoresis separation, gel-based separation, affinityseparation and antibody separation
 18. The method of claim 1, whereinone of the three variables is mass associated with the mass axis of amass spectrometer.
 19. The method of claim 18, wherein said apparatusfurther comprises at least one chromatography system for providing saidseparated samples to said mass spectrometer, retention time being atleast one of the variables.
 20. The method of claim 18, wherein saidapparatus further comprises at least one electrophoresis separationsystem for providing said separated samples to said mass spectrometer,migration characteristics of said sample being at least one of thevariables.
 21. The method of claim 18, wherein said data is continuummass spectral data.
 22. The method of claim 18, wherein said data isused without centroiding.
 23. The method of claim 18, further comprisingcorrecting said data for time skew.
 24. The method of claim 18, furthercomprising performing a calibration of said data with respect to massand spectral peak shapes.
 25. The method of claim 18, wherein saidapparatus comprises a protein chip having a plurality of proteinaffinity regions, location of a region being one of said threevariables.
 26. The method of claim 1, wherein said multichannel analyzeris based on one of light absorption, light emission, light reflection,light transmission, light scattering, refractive index,electrochemistry, conductivity, radioactivity, or any combinationthereof.
 27. The method of claim 26, wherein the components in saidsample are bound to at least one of fluorescence tags, isotope tags,stains, affinity tags, or antibody tags.
 28. The method of claim 1,wherein said apparatus comprises a two-dimensional electrophoresisseparation system.
 29. The method of claim 28, wherein a first of saidat least one variable is isoelectric point and a second of said at leastone variable is molecular weight.
 30. A computer readable medium havingthereon computer readable code for use with a chemical analysis systemhaving a data analysis portion for analyzing data obtained from asample, said chemical analysis system having a separation portion thathas a capability for separating components of a sample containing morethan one component as a function of at least one variable, said computerreadable code being for causing the computer to perform a methodcomprising: separating said sample with respect to at least a firstvariable to form a separated sample; separating said separated samplewith respect to at least a second variable to form a further separatedsample; obtaining data representative of said further separated samplefrom a multi-channel analyzer, said data being expressed as a functionof three variables; forming a data stack having successive levels, eachlevel containing data from one channel of said multi-channel analyzer;forming a data array representative of a compilation of all of the datain said data stack; and separating said data array into a series ofmatrixes or arrays, said matrixes or arrays being: a concentration dataarray representative of concentration of each component in said sampleon its super-diagonal; a first profile of each component as a functionof a first variable; a second profile of each component as a function ofa second variable; and a third profile of each component as a functionof a third variable.
 31. A chemical analysis system for analyzing dataobtained from a sample, said system having a separation system that hasa capability for separating components of a sample containing more thanone component as a function of at least one variable, said system havingapparatus for performing a method comprising: separating said samplewith respect to at least a first variable to form a separated sample;separating said separated sample with respect to at least a secondvariable to form a further separated sample; obtaining datarepresentative of said further separated sample from a multi-channelanalyzer, said data being expressed as a function of three variables;forming a data stack having successive levels, each level containingdata from one channel of said multi-channel analyzer; forming a dataarray representative of a compilation of all of the data in said datastack; and separating said data array into a series of matrixes orarrays, said matrixes or arrays being: a concentration data arrayrepresentative of concentration of each component in said sample on itssuper-diagonal; a first profile of each component as a function of afirst variable; a second profile of each component as a function of asecond variable; and a third profile of each component as a function ofa third variable.